9 - Transcription Factor Activity - How to score & interpret them

Author

Marc Elosua Bayes

Published

March 8, 2024

Introduction

scRNA-seq data returns individual molecular reads for each cell representing the expression of each gene in each cell. However, transcript abundances at the individual gene level can be hard to interpret. Another confounding factor with these readouts is the high sparsity of the data. This sparsity acutely affects genes with low mRNA abundance (Figure 1) Mereu et al. (2020) . Transcription factors (TFs) are key players involved in regulating the present and future cell states by binding to regulatory regions in the DNA and driving gene expression programs Baskar et al. (2022). Therefore, they are tightly regulated and are often found at low abundances due to their powerful effects on the cells. Hence, being able to quantify the activity of TFs in a cell can provide very valuable information when characterizing the biological processes underlying a cell type or state. However, due to their low expression they severely suffer from dropout events and their mRNA abundance can’t be accurately quantified by looking at the number of UMIs. To address this issue methods have been developed to quantify their activities by leveraging the expression of the genes they regulate.


Figure 1. Dropout Probability vs Expression level

In this vignette we are going to go over the best practices on how to compute these activites, what we need to take into account when computing them and follow a recent benchmarking paper to determine which are the best reference databases Müller-Dott et al. (2023) and tools to use like decoupler Badia-i-Mompel et al. (2022)!

Before we start here are some key concepts that will help us and frame the vignette!

  • What is a transcription factor?

    Transcription factors are broadly understood as proteins that bind to regulatory regions of the DNA acting as key regulators of gene-expression programs Baskar et al. (2022).

  • What information do we need to compute the activity of a TF?

    The activity of TF is scored based on the expression of the genes it regulates. Therefore, we need a database that contains which genes are regulated by each transcription factor and the relation between them. Some TF can activate the expression of some genes and repress the expression of others. There are many databases that contain this information and in this vignette we aim to provide the current state of the art databases to use, this can be considered as a gene regulatory network (GRN).

  • Do transcription factors act the same in all cell types?

    No! This is crucial to keep in mind when interpreting TF activities. If we take as an example Blimp1 (PRDM1) a well characterized TF in B and T cells it has been shown to have very different functions. In B cells, Blimp1 drives plasmablast formation and antibody secretion, whereas in T cells, Blimp1 regulates functional differentiation, including cytokine gene expression. Studies have determined both conserved and unique functions of Blimp1 in different immune cell subsets such as the unique direct activation of the igh gene transcription in B cells and a conserved antagonism with BCL6 in B cells, T cells, and myeloid cells Nadeau and Martins (2022). This is important to consider because ideally we would have gene-specific GRNs. These can be obtained from multiome datasets but most of the time we don’t have this information, hence reference GRNs are available for these cases and this is what is going to be covered in this vignette.

  • How do we score them in our dataset?

    There are many ways to score the activation of TFs as shown in the decoupler paper Badia-i-Mompel et al. (2022). However, they do not all perform the same and it is important to select a robust method. The suggested method after their benchmarking analysis is running a Univariate Linear Model (ULM) where the gene expression values are the response variable and the regulator weights in the gene signature are the explanatory one (don’t worry, we’ll go through this in more detail in a second). The obtained t-value from the fitted model is the activity score of that gene signature in that cell.

  • How do we interpret the activity obtained?

    Scoring gene signatures using Univariate Linear Models and using the resulting t-value as the scoring metric allows us to simultaneously interpret in a single statistic the direction of activity (either + or -) and its significance (the magnitude of the score).

  • Can we further interrogate the activity scores obtained?

    Yes! In fact it is very important to look past the score obtained by a cell and into which are the genes driving that activity. Sometimes with TF regulating many genes downstream it could be that just a few genes are contributing to its activity in our dataset. Therefore, if we just stopped at the activity score we could be mislead into thinking that all of the genes downstream of the TF are important when it , usually, is actually only a fraction of them. Moreover, heterogeneous gene expression between two populations can also lead to 2 cells or populations having similar scores for one TF but vastly different genes gene programs underlying them.

Libraries

Load the libraries and install the packages needed to run this notebook

if (!requireNamespace("Seurat", quietly = TRUE))
    install.packages("Seurat")

if (!requireNamespace("tidyverse", quietly = TRUE))
    install.packages("tidyverse")

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

if (!requireNamespace("remotes", quietly = TRUE))
    install.packages("remotes")
    
if (!requireNamespace("decoupleR", quietly = TRUE))
    remotes::install_github('saezlab/decoupleR')

if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
    BiocManager::install("ComplexHeatmap")

if (!requireNamespace("glue", quietly = TRUE))
    install.packages("glue")

if (!requireNamespace("colorBlindness", quietly = TRUE))
    install.packages("colorBlindness")

if (!requireNamespace("ggpmisc", quietly = TRUE))
    install.packages("ggpmisc")

if (!requireNamespace("circlize", quietly = TRUE))
    install.packages("circlize")

library(Seurat)
library(tidyverse)
library(decoupleR)
library(ComplexHeatmap)
library(colorBlindness)
library(ggpmisc)

# Remember to set a seed so the analysis is reproducible!
set.seed(687)

Load data

We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from cellxgene portal.

# Download the data in data/ directory
# download.file(
#     url = "https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds",
#     destfile = "../data/workshop-data.rds",
#     method = "wget",
#     extra = "-r -p --random-wait")
# We can also use the CLI with the wget command below
# wget https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds

se <- readRDS("../data/workshop-data.rds")
# Remove Uncategorized
se <- se[, ! se$Celltype %in% c("Uncategorized1", "Uncategorized2")]
se$Celltype <- as.character(se$Celltype)

# Subset for the sake of speed
se <- se[, sample(colnames(se), 0.25 * ncol(se))]

Generate a color palette for our cell types

# https://www.datanovia.com/en/blog/easy-way-to-expand-color-palettes-in-r/
nb.cols <- length(unique(se$Celltype))
pal <- colorRampPalette(paletteMartin)(nb.cols)
names(pal) <- sort(unique(se$Celltype))

Analysis

Convert ENSEMBL IDs to Gene Symbols

Right away we can see how ensembl ids are used in the rownames. Let’s transform them into their matched symbols to make them human-readable:

head(rownames(se))
[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460" "ENSG00000000938"

Convert to gene symbols

gene_df <- readr::read_csv(file = "../data/cov_flu_gene_names_table.csv")

symbol_id <- data.frame(index = rownames(se)) %>%
    left_join(gene_df, by = "index") %>%
    pull(feature_name)

# re-create seurat object
mtx <- se@assays$RNA$data
rownames(mtx) <- symbol_id
se <- CreateSeuratObject(counts = mtx, meta.data = se@meta.data)

Preprocessing

We will do a quick preprocessing of the data. 1) log-normalize, 2) identify highly variable genes, 3) scale their expression and 4) compute PCA on the scaled data.

se <- se %>%
    NormalizeData(verbose = FALSE) %>%
    FindVariableFeatures(nfeatures = 3000, verbose = FALSE) %>%
    ScaleData(verbose = FALSE) %>%
    RunPCA(verbose = FALSE)

Next we check the elbow plot to determine the number of PCs to use for the downstream analysis and then compute UMAP, K-nearest neighbor graph (KNN graph) and run Louvain clustering on it.

# Look at elbow plot to assess the number of PCs to use
ElbowPlot(se)

We can see a clear elbow at 10 PCs, we’re going to extend it a bit more and use 15 PCs for the downstream analysis to make sure we are not loosing any biological signal

se <- RunUMAP(se, reduction = "pca", dims = 1:30, verbose = FALSE)

Next we compute the K-nearest-neighbor graph

# se <- se %>%
#     FindNeighbors(verbose = FALSE) %>%
#     FindClusters(resolution = c(0.05, 0.1, 0.125, 0.15, 0.2), verbose = FALSE)
# 
# # Visualize the clustering on the UMAP
# DimPlot(
#     se,
#     group.by = c(
#         "RNA_snn_res.0.05", "RNA_snn_res.0.1", "RNA_snn_res.0.125",
#         "RNA_snn_res.0.15", "RNA_snn_res.0.2"))

For the purpose of this tutorial we are going to proceed with resolution 0.15

(dim_plt <- DimPlot(
    se,
    group.by = "Celltype") +
    scale_color_manual(values = pal))

TF activity scoring

As mentioned in the introduction we need a GRN specifying the relation between a TF and its downstream genes. A recent benchmark has shown how CollecTRI is the best database to estimate TF activities Müller-Dott et al. (2023). CollecTRI contains regulons of signed TF-target genes that have been compiled from 12 different sources, provides increased TF coverage, and in a recent benchmark showed a superior performance in indentifying TF perturbations from gene expression data Müller-Dott et al. (2023). You can see the full explanation in the CollecTRI gihtub page from the Saez lab. We will also follow the decoupler vignette on how to download and score these signatures.

net <- get_collectri(organism = "human", split_complexes = FALSE)
net
# A tibble: 43,178 × 3
   source target   mor
   <chr>  <chr>  <dbl>
 1 MYC    TERT       1
 2 SPI1   BGLAP      1
 3 SMAD3  JUN        1
 4 SMAD4  JUN        1
 5 STAT5A IL2        1
 6 STAT5B IL2        1
 7 RELA   FAS        1
 8 WT1    NR0B1      1
 9 NR0B2  CASP1      1
10 SP1    ALDOA      1
# ℹ 43,168 more rows

Note we are using organism = "human" since we are working with human data - mouse and rat are also available and split_complexes=False since we want to ensure that TF that regulate complexes downstream are scored when the multiple subunits in the complex are present.

In the net dataframe we have the regulation relation between each TF (source) and their downstream genes

# Look at the targets of MYC
net %>% dplyr::filter(source == "MYC")
# A tibble: 888 × 3
   source target   mor
   <chr>  <chr>  <dbl>
 1 MYC    TERT       1
 2 MYC    EPO        1
 3 MYC    ENO1       1
 4 MYC    CDC25A     1
 5 MYC    EMP1       1
 6 MYC    CXCR4      1
 7 MYC    CDKN1A    -1
 8 MYC    TP53       1
 9 MYC    ACTB       1
10 MYC    BRCA1      1
# ℹ 878 more rows
# Look at the weights of MYC
Hmisc::describe(net$mor)
net$mor 
       n  missing distinct     Info     Mean      Gmd 
   43178        0        2     0.35   0.7306   0.4662 
                      
Value         -1     1
Frequency   5816 37362
Proportion 0.135 0.865

From here we can gather that in this database MYC is regulating 886 genes-gene complexes. The weight indicates if MYC activates or represses the activity of a genes. Target genes with values > 0 indicate that MYC promotes the expression and viceversa in those with a weight < 1.

It is important to note that here we are using CollecTRI as a reference database. Therefore, some TF may be missing or the mode of regulation may be different from the expected in a cell type of interest. If you have inhouse curated GRNs obtained from inference methods such as CellOracle, pySCENIC or SCENIC+ you can plug them in here. These have the advantage that you can compute GRNs for your cell types and in turn use them in your scRNAseq dataset!

Activity inference with univariate linear model (ULM)

“Univariate Linear Model (ULM) fits a linear model for each sample and regulator, where the observed molecular readouts in mat are the response variable and the regulator weights in net are the explanatory one. Target features with no associated weight are set to zero. The obtained t-value from the fitted model is the activity ulm of a given regulator.”

Figure 2. Univariate linear model method graphical representation.

Moreover, a nice thing about ulm is that in a single statistic it provides the direction of activity (either + or -) and its significance (the magnitude of the score). Making the scores very easy to interpret!

So lets compute the activity scores for every cell in our dataset!

ulm_start <- Sys.time()
res <- decoupleR::run_ulm(
    mat = se@assays$RNA$data,
    network = net,
    .source = "source",
    .target = "target",
    .mor = "mor")

glue::glue("Time to run ulm is {round(difftime(Sys.time(), ulm_start, units = 'min'), 0)} minutes")
Time to run ulm is 6 minutes

Save data

We can see how every cells has a score for every signature!

 # Looking at the first 10 entries
res
# A tibble: 11,217,279 × 5
   statistic source condition           score   p_value
   <chr>     <chr>  <chr>               <dbl>     <dbl>
 1 ulm       MYC    AAACCCAAGTAAGGGA-18  27.8 2.40e-168
 2 ulm       MYC    AAACCCACACTGGAAG-16  29.6 2.58e-190
 3 ulm       MYC    AAACCCACAGCTGAAG-1   24.8 3.78e-134
 4 ulm       MYC    AAACCCACATGAGATA-5   21.4 2.49e-101
 5 ulm       MYC    AAACCCAGTCTACAGT-14  23.4 2.27e-120
 6 ulm       MYC    AAACCCAGTCTTCGAA-1   27.9 7.09e-170
 7 ulm       MYC    AAACCCAGTGTTAGCT-3   24.5 2.72e-131
 8 ulm       MYC    AAACCCATCAAGCGTT-15  23.8 2.03e-124
 9 ulm       MYC    AAACCCATCACTGTCC-16  10.8 3.80e- 27
10 ulm       MYC    AAACCCATCGCGAAGA-16  23.7 1.71e-123
# ℹ 11,217,269 more rows
# Check the cell type for cell AAACCCAAGGGCAATC-1
ct_b <- "AAACCCACAGCTGAAG-1"
ct_nk <- "ACCAAACTCGGTGTTA-5"
se@meta.data[c(ct_b, ct_nk), "Celltype", drop = FALSE]
                       Celltype
AAACCCACAGCTGAAG-1 B cell, IgG-
ACCAAACTCGGTGTTA-5      NK cell
# Looking at all the scores for one specific cell
res %>%
    filter(condition %in% c(ct_b, ct_nk)) %>%
    arrange(condition, desc(score))
# A tibble: 1,542 × 5
   statistic source condition          score   p_value
   <chr>     <chr>  <chr>              <dbl>     <dbl>
 1 ulm       MYC    AAACCCACAGCTGAAG-1 24.8  3.78e-134
 2 ulm       RFXAP  AAACCCACAGCTGAAG-1 21.7  2.48e-103
 3 ulm       RFXANK AAACCCACAGCTGAAG-1 20.3  2.81e- 91
 4 ulm       CIITA  AAACCCACAGCTGAAG-1 20.2  2.70e- 90
 5 ulm       RFX5   AAACCCACAGCTGAAG-1 18.6  1.29e- 76
 6 ulm       ZBTB4  AAACCCACAGCTGAAG-1 14.2  1.11e- 45
 7 ulm       ZBED1  AAACCCACAGCTGAAG-1 13.0  1.61e- 38
 8 ulm       HIVEP2 AAACCCACAGCTGAAG-1 11.3  1.46e- 29
 9 ulm       RFX1   AAACCCACAGCTGAAG-1 10.2  1.44e- 24
10 ulm       NFKB   AAACCCACAGCTGAAG-1  8.61 7.69e- 18
# ℹ 1,532 more rows

Visualization

We can directly add the ulm scores to an assay in our object and visualize the results

se[['tfulm']] <- res %>%
  pivot_wider(
      id_cols = 'source',
      names_from = 'condition',
      values_from = 'score') %>%
  column_to_rownames('source') %>%
  Seurat::CreateAssayObject(.)

# Change assay
DefaultAssay(object = se) <- "tfulm"

# Scale the data for comparison across signatures 
se <- ScaleData(se)
se@assays$tfulm@data <- se@assays$tfulm@scale.data
UMAP visualization

Plot TF activities on the UMAP embedding

plt <- FeaturePlot(
    se,
    features = c("PAX5", "EBF1", "EOMES"),
    ncol = 2,
    slot = "data") &
    scale_color_viridis_c(option = "magma")

plt + dim_plt

As mentioned in the introduction, TF are usually very lowly expressed and thus the mRNAs suffer from higher droput rates than more highly expressed genes with more mRNA copies. Using TF activities showcases how we can recover their activity by looking at the genes a TF regulates downstream.

# TF activity scores
DefaultAssay(se) <- "tfulm"
plt1 <- FeaturePlot(
    se,
    features = c("PAX5", "EBF1", "EOMES"),
    ncol = 3,
    slot = "data") &
    scale_color_viridis_c(option = "magma")

DefaultAssay(se) <- "RNA"
plt2 <- FeaturePlot(
    se,
    features = c("PAX5", "EBF1", "EOMES"),
    ncol = 3,
    slot = "data") &
    scale_color_viridis_c(option = "magma")

plt1 / plt2

We can see how the difference is night and day. In the top row we see the TF activities while in the bottom the normalized gene expression. In all 3 examples (PAX5, EBF1, EOMES) the transcript is rarely detected while the activity is well recapitulated.

Heatmap by cells

We can also visualize the top 25 most highly variable TF scores using a heatmap

n_tfs <- 25
# Extract activities from object as a long dataframe
df <- t(as.matrix(se@assays$tfulm@counts)) %>%
    as.data.frame() %>%
    mutate(cluster = Idents(se)) %>%
    pivot_longer(cols = -cluster, names_to = "source", values_to = "score")

# Get top tfs with more variable means across **cells**
tfs <- df %>%
    group_by(source) %>%
    summarise(std = sd(score)) %>%
    arrange(-abs(std)) %>%
    head(n_tfs) %>%
    pull(source)

glue::glue("Note that the maximum scaled value is: {round(max(se@assays$tfulm@data[tfs, ]), 2)}, and the minimum is {round(min(se@assays$tfulm@data[tfs, ]), 2)}.")
Note that the maximum scaled value is: 10, and the minimum is -11.16.
DoHeatmap(
    se,
    features = tfs,
    group.by = "Celltype",
    assay = "tfulm",
    disp.max = quantile(se@assays$tfulm@data[tfs, ], .99),
    disp.min = quantile(se@assays$tfulm@data[tfs, ], .01),
    slot = "data") +
    scale_fill_viridis_c(option = "viridis")

Heatmap by groups

From the plot above we can see how we have very distinct populations in our datasets. We can also look at it a bit less granular by looking at the mean activity score per cluster.

n_tfs <- 25
# Extract activities from object as a long dataframe
df <- t(as.matrix(se@assays$tfulm@data)) %>%
    as.data.frame() %>%
    mutate(cluster = se$Celltype) %>%
    pivot_longer(cols = -cluster, names_to = "source", values_to = "score") %>%
    group_by(cluster, source) %>%
    summarise(mean = mean(score))

# Get top tfs with more variable means across **clusters**
tfs <- df %>%
    group_by(source) %>%
    summarise(std = sd(mean)) %>%
    arrange(-abs(std)) %>%
    head(n_tfs) %>%
    pull(source)

# Transform to wide matrix
top_acts_mat <- df %>%
    filter(source %in% tfs) %>%
    pivot_wider(id_cols = 'cluster', names_from = 'source',
              values_from = 'mean') %>%
    column_to_rownames('cluster') %>%
    as.matrix()

# Choose color palette
palette_length <- 100
my_color <- colorRampPalette(c("Darkblue", "white","red"))(palette_length)

# Show which is the max and min of the scaled value to make sure we set a scale that makes sense
glue::glue("Note that the maximum scaled value is: {round(max(top_acts_mat), 2)}, and the minimum is {round(min(top_acts_mat), 2)}.")
Note that the maximum scaled value is: 5.11, and the minimum is -4.99.
my_breaks <- c(seq(quantile(top_acts_mat, .01), 0, length.out=ceiling(palette_length/2) + 1),
               seq(0.05, quantile(top_acts_mat, .99), length.out=floor(palette_length/2)))
# Plot
ComplexHeatmap::pheatmap(top_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)

From this approach we see how RBCs are driving tis visualization. When this happens we can follow another approach in which we show the top activated TF per cluster.

# Get top tfs with more variable means across **clusters**
tfs_df <- df %>%
    group_by(cluster) %>%
    top_n(n = 5, wt = abs(mean))

# We can see how cluster 6 has the highest TF activity scores
# we are going to take this into account when visualizing the data!
tfs_df %>% head()
# A tibble: 6 × 3
# Groups:   cluster [2]
  cluster      source  mean
  <chr>        <chr>  <dbl>
1 B cell, IgG+ CHD4   -2.22
2 B cell, IgG+ EBF1    2.20
3 B cell, IgG+ ELF2    1.94
4 B cell, IgG+ PAX5    1.67
5 B cell, IgG+ PHF5A   2.68
6 B cell, IgG- CHD4   -2.42
# Extract the TF onto a list
tfs <- tfs_df %>%
    pull(source)

# Transform to wide matrix
top_acts_mat <- df %>%
    filter(source %in% tfs) %>%
    pivot_wider(id_cols = 'cluster', names_from = 'source',
              values_from = 'mean') %>%
    column_to_rownames('cluster') %>%
    as.matrix()

# Choose color palette
palette_length <- 100
my_color <- colorRampPalette(c("Darkblue", "white","red"))(palette_length)

# Show which is the max and min of the scaled value to make sure we set a scale that makes sense
glue::glue("Note that the maximum scaled value is: {round(max(top_acts_mat), 2)}, and the minimum is {round(min(top_acts_mat), 2)}.")
Note that the maximum scaled value is: 5.11, and the minimum is -4.99.
my_breaks <- c(seq(quantile(top_acts_mat, .01), 0, length.out=ceiling(palette_length/2) + 1),
               seq(0.05, quantile(top_acts_mat, .99), length.out=floor(palette_length/2)))
# Plot
ComplexHeatmap::pheatmap(top_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)

We can clearly assess that the transcription factors returned are cell type specific since B cells cluster together, and T cells and monocytes as well.

Heatmap by condition & cell type

Lastly we want to check if there are differences in the TF activation depending on cell type and disease

meta_sub <- se@meta.data %>%
    rownames_to_column("bc") %>%
    dplyr::select(bc, Celltype, disease)
df <- t(as.matrix(se@assays$tfulm@data)) %>%
    as.data.frame() %>%
    rownames_to_column("bc") %>%
    left_join(meta_sub, by = "bc") %>%
    column_to_rownames("bc") %>%
    pivot_longer(cols = -c(Celltype, disease), names_to = "source", values_to = "score") %>%
    group_by(Celltype, source, disease) %>%
    summarise(mean = mean(score)) %>%
    mutate(disease_celltype = glue::glue("{disease} - {Celltype}"))

# Get top tfs with more variable means across **clusters**
tfs_df <- df %>%
    group_by(disease_celltype) %>%
    top_n(n = 10, wt = abs(mean))

# We can see how cluster 6 has the highest TF activity scores
# we are going to take this into account when visualizing the data!
# tfs_df %>% data.frame()

# Extract the TF onto a list
tfs <- tfs_df %>%
    pull(source)

# Transform to wide matrix
top_acts_mat <- df %>%
    filter(source %in% tfs) %>%
    pivot_wider(id_cols = 'disease_celltype', names_from = 'source',
              values_from = 'mean') %>%
    column_to_rownames('disease_celltype') %>%
    as.matrix()

# Choose color palette
palette_length <- 100
my_color <- colorRampPalette(c("Darkblue", "white","red"))(palette_length)

# Show which is the max and min of the scaled value to make sure we set a scale that makes sense
glue::glue("Note that the maximum scaled value is: {round(max(top_acts_mat), 2)}, and the minimum is {round(min(top_acts_mat), 2)}.")
Note that the maximum scaled value is: 6.58, and the minimum is -6.27.
my_breaks <- c(seq(quantile(top_acts_mat, .01), 0, length.out=ceiling(palette_length/2) + 1),
               seq(0.05, quantile(top_acts_mat, .99), length.out=floor(palette_length/2)))
# Plot
ComplexHeatmap::pheatmap(top_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)

This heatmap can be overwhelming, we can take a look at specific populations of interest, in this case monocytes and highlight the most highly variable ones:

n_tfs <- 25

# Get top tfs with more variable means across **clusters**
tfs <- df %>%
    # Only keep monocytes
    filter(str_detect(Celltype, "Monocyte")) %>%
    group_by(source) %>%
    summarise(std = sd(mean)) %>%
    arrange(-abs(std)) %>%
    head(n_tfs) %>%
    pull(source)

# Transform to wide matrix
top_acts_mat <- df %>%
    filter(source %in% tfs) %>%
    filter(str_detect(Celltype, "Monocyte")) %>%
    pivot_wider(id_cols = 'disease_celltype', names_from = 'source',
              values_from = 'mean') %>%
    column_to_rownames('disease_celltype') %>%
    as.matrix()

# Choose color palette
palette_length <- 100
my_color <- colorRampPalette(c("Darkblue", "white","red"))(palette_length)

# Show which is the max and min of the scaled value to make sure we set a scale that makes sense
glue::glue("Note that the maximum scaled value is: {round(max(top_acts_mat), 2)}, and the minimum is {round(min(top_acts_mat), 2)}.")
Note that the maximum scaled value is: 2.03, and the minimum is -1.63.
my_breaks <- c(seq(quantile(top_acts_mat, .01), 0, length.out=ceiling(palette_length/2) + 1),
               seq(0.05, quantile(top_acts_mat, .99), length.out=floor(palette_length/2)))
# Plot
ComplexHeatmap::pheatmap(top_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)

In this last scenario we can see how the cell types cluster by major cell type, not by condition. However, in intermediate monocytes there seems to be an enrichment in interferon response genes in those from influenza or covid patients.

Heatmap for gene expression

To fully grasp which genes are driving each gene signature within each cell we want to visualize the gene expression of the genes involved in each gene signature for each cell. We can do so using the ComplexHeatmap package and a little bit of data processing. For ease here is a function you can incorporate in your analysis:

geneHM <- function(
        object,
        sig_df,
        sig_name,
        sig_assay,
        .source,
        .target,
        sig_slot = "data",
        expr_assay = "RNA",
        expr_slot = "data",
        grouping = NULL,
        grouping_color = NULL,
        expr_cols = viridisLite::magma(100)) {
    
    # Extract Gene Expression Matrix from Seurat Object
    gene_expr <- GetAssayData(object, assay = expr_assay, slot = expr_slot)
    
    # Subset the genes of the signature from the Gene Expression Matrix
    genes_of_interest <- sig_df[, .target][which(sig_df[, .source] %in% sig_name)]
    
    # Subset the genes intersecting between gene expression and genes in signature
    g_int <- intersect(rownames(gene_expr), genes_of_interest)
    
    if (length(g_int) < length(genes_of_interest)) {
        genes_excluded <- genes_of_interest[!genes_of_interest %in% rownames(gene_expr)]
        genes_excluded <- paste(genes_excluded, collapse = ", ")
        message(paste0(
            "Genes ", genes_excluded,
            " are in the gene signature but not in the expression matrix,",
            " therefore, they have been excluded."))
    }
    
    # Subset expression matrix to only genes of interest
    gene_expr <- gene_expr[g_int, ]
    
    # Extract the Scores of the Signature of interest
    sig_score <- GetAssayData(object, assay = sig_assay, slot = sig_slot)
    sig_vec <- sig_score[sig_name, ]
    anno <- data.frame(score = sig_vec)
    # Make sure they are in the right order
    anno <- anno[colnames(gene_expr), , drop = FALSE]
    
    # Add any metadata if specified
    if (!is.null(grouping)) {
        meta <- object@meta.data[, grouping, drop = FALSE]
        anno <- cbind(anno, meta[rownames(anno), , drop = FALSE])
    }
    
    if (any(is.infinite(c(anno$score))))
        stop("There are scores with Inf values, please address this outside of this function. It could be because the slot used is scale_data.")
    
    # Make list of color to paint the annotation columns
    if (!is.null(grouping) & !is.null(grouping_color)) {
        score <- circlize::colorRamp2(
            breaks = c(min(anno$score), 0, max(anno$score)),
            colors = c("blue", "white", "red"))
        
        color_ls <- append(grouping_color, score)
        names(color_ls)[length(color_ls)] <- "score"
        
    } else {
        color_ls <- list(
            score = circlize::colorRamp2
            (breaks = c(min(anno$score), 0, max(anno$score)),
                      colors = c("blue", "white", "red")),
            RNA_snn_res.0.15 = clust_color)
    }
    
    # Set the order from most expressing to least expressing 
    ord <- rownames(anno[order(anno$score, decreasing = TRUE), ])
    # Add the score of the signature as annotation in the heatmap
    colAnn <- HeatmapAnnotation(
        df = anno[ord, , drop = FALSE],
        which = 'column',
        col = color_ls)
    
    # Visualize the Heatmap with the genes and signature 
    ht <- ComplexHeatmap::Heatmap(
        as.matrix(gene_expr[, ord]),
        name = "Gene Expression",
        col = expr_cols,
        cluster_rows = TRUE,
        cluster_columns = TRUE,
        column_title = sig_name,
        column_names_gp = gpar(fontsize = 14),
        show_column_names = FALSE,
        top_annotation = colAnn)
    
    # Return ComplexHeatmap
    draw(ht)
}

pal
         B cell, IgG-          B cell, IgG+          CD4, EM-like      CD4, non-EM-like          CD8, EM-like      CD8, non-EM-like    classical Monocyte                    DC intermediate Monocyte               NK cell nonclassical Monocyte              Platelet                   RBC 
            "#000000"             "#005555"             "#55859E"             "#FF91C8"             "#853CAA"             "#0C5ACE"             "#B66DFF"             "#79BCFF"             "#AA91A9"             "#922400"             "#C26000"             "#42E61E"             "#FFFF6D" 
# Visualize the heatmaps for all signatures
# tt <- lapply(unique(sig_df$signature), function(i) {
#     geneHM(
#         object = se,
#         sig_df = sig_df,
#         .source = "signature",
#         .target = "gene",
#         sig_name = i,
#         expr_slot = "data",
#         expr_assay = "RNA",
#         sig_assay = "pathwaysulm",
#         sig_slot = "data",
#         grouping = c("RNA_snn_res.0.15"),
#         grouping_color = list(RNA_snn_res.0.15 = clust_color))
# })

Here are some examples of how to interpret these gene signatures:

  1. Diving deeper into the EBF1 signature we can see how the cells with a high activity score have a high expression of CD79A/B which have a mode of regulation of 1. In turn, those with a negative score have a high expression of GATA3 whose mor is -1, thus indicating that the EBF1 pathway is turned off.
# Subset to 10% of the original dataset for speed and visualization purposes
se_sub <- se[, sample(colnames(se), 0.1 * ncol(se))]

geneHM(
    object = se_sub,
    sig_df = data.frame(net),
    .source = "source",
    .target = "target",
    sig_name = "EBF1",
    expr_slot = "data",
    expr_assay = "RNA",
    sig_assay = "tfulm",
    sig_slot = "data",
    grouping = c("Celltype"),
    grouping_color = list(Celltype = pal))

net %>% dplyr::filter(source == "EBF1" & target %in% c("CD79A", "CD79B", "GATA3"))
# A tibble: 3 × 3
  source target   mor
  <chr>  <chr>  <dbl>
1 EBF1   CD79A      1
2 EBF1   GATA3     -1
3 EBF1   CD79B      1

Briefly, when scoring TF activities and looking at their activities it is important to not only look at the overall score obtained for each cell but one also needs to dive deeper into which are the genes that are driving that signature!

TF by disease

Lastly we can also look at the TF activity per cell type by condition of interest as below:

disease_pal <- c("#41AE76", "#225EA8", "#E31A1C")
names(disease_pal) <- c("normal", "influenza", "COVID-19")
# make a donor palette
donor_pal <- c(
    "#66C2A4", "#41AE76", "#238B45", "#006D2C",
    "#41B6C4", "#1D91C0", "#225EA8", "#253494", "#081D58",
    "#FFEDA0", "#FED976", "#FEB24C", "#FD8D3C",
    "#FC4E2A", "#E31A1C", "#BD0026", "#800026")

names(donor_pal) <- c(
    "Normal 1", "Normal 2", "Normal 3", "Normal 4",
    "Flu 1", "Flu 2", "Flu 3", "Flu 4", "Flu 5",
    "nCoV 1", "nCoV 2", "nCoV 3_4", "nCoV 5",
    "nCoV 6", "nCoV 7_8", "nCoV 9_10", "nCoV 11"  
)

dd <- bind_cols(se@meta.data, t(as.matrix(se@assays$tfulm@counts))) %>%
        tidyr::pivot_longer(
            cols = rownames(se@assays$tfulm@counts),
            names_to = "TF",
            values_to = "score") %>%
        group_by(Celltype, disease, donor_id, TF) %>%
        summarise(median_score = median(score))

# Let's look at some TF of interest
lapply(c("IRF1", "IRF2", "IRF6", "IRF9"), function(i) {
    dd %>%
        filter(TF == i) %>%
        ggplot(aes(x = disease, y = median_score, fill = disease)) +
        geom_boxplot(outlier.shape = NA, alpha = 0.7) +
        geom_jitter(aes(color = donor_id), height = 0) +
        facet_wrap(~ Celltype) +
        scale_fill_manual(values = disease_pal) +
        scale_color_manual(values = donor_pal) +
        scale_x_discrete(limits = names(disease_pal)) +
        theme_classic() +
        labs(title = glue::glue("Factors by condition in {i}"))
})
[[1]]


[[2]]


[[3]]


[[4]]

Extra!

How does a univariate linear model work?

Lets start with a toy example. Imagine a very simple scenario where we have two very simple vectors where one is double the other. We can compute the linear model and also easily visualize the relationship between both vectors:

# Define vectors of interest
vec1 <- c(-1, -0.5, 1, 2, 5)
vec2 <- c(-0.8, -0.7, 1.4, 1.8, 4)

# Run the linear model
summary(lm(vec2 ~ vec1))

Call:
lm(formula = vec2 ~ vec1)

Residuals:
       1        2        3        4        5 
-0.04956 -0.36053  0.50658  0.08465 -0.18114 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.07149    0.19800   0.361  0.74197   
vec1         0.82193    0.07920  10.378  0.00191 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3782 on 3 degrees of freedom
Multiple R-squared:  0.9729,    Adjusted R-squared:  0.9639 
F-statistic: 107.7 on 1 and 3 DF,  p-value: 0.001909
# Visualize the data
(p <- ggplot(mapping = aes(x = vec1, y = vec2)) +
        geom_point() +
        geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
        geom_hline(yintercept = 0, size = 0.25) +
        geom_vline(xintercept = 0, size = 0.25) +
        coord_fixed() +
        theme_minimal())

# now we can add the slope of the line that best fits our data and the T value
p +
    stat_poly_line(formula = y ~ x, se = FALSE) +
    stat_poly_eq(use_label(c("eq"))) +
    stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)

In the example above we see the linear relationship between both vectors and we get the slope and the T value:
- The slope indicates the what is the change in the response variable (vec2) given a 1 unit change in the predictor variable (vec1).

- The T statistic is the result of a T test. The T test assesses the significance of individual coefficients in our model. The T value indicates the number of standard errors the estimated coefficient is away from the null hypothesis (t = 0). Remember the T value is the \(\frac{coefficient}{standard~error}\).

An example of an activated TF activity score would look like this:

# Define vectors of interest
gex <- c(0, 0, 2.5, 0.65, 1.2)
mor <- c(-1, -1, 1, 1, 1)

# Run the linear model
summary(lm(gex ~ mor))

Call:
lm(formula = gex ~ mor)

Residuals:
         1          2          3          4          5 
 1.884e-16 -1.140e-16  1.050e+00 -8.000e-01 -2.500e-01 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    0.725      0.354   2.048    0.133
mor            0.725      0.354   2.048    0.133

Residual standard error: 0.7757 on 3 degrees of freedom
Multiple R-squared:  0.5829,    Adjusted R-squared:  0.4439 
F-statistic: 4.193 on 1 and 3 DF,  p-value: 0.133
# Visualize the data
(p <- ggplot(mapping = aes(x = gex, y = mor)) +
        geom_jitter(width = 0, height = 0.1) +
        geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
        geom_hline(yintercept = 0, size = 0.25) +
        geom_vline(xintercept = 0, size = 0.25) +
        coord_fixed() +
        theme_minimal())

# now we can add the slope of the line that best fits our data and the T value
p +
    stat_poly_line(formula = y ~ x, se = FALSE) +
    stat_poly_eq(use_label(c("eq"))) +
    stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)

Here we see how genes with an mor of -1 (the TF represses the expression of that gene) are not expressed (gex = 0), while those with and mor = 1 (the TF promotes the expression of that gene) are expressed!

Now lets look at a “real world” example, we want to score the B cell signature in one cell. First we are going to start by visualizing the relationship between the weights and the gene expression for 2 cells of interest, one is a B cell and the other is not.

We need to do a bit of data prep but bear with me!

# We have our gene expression matrix
mat <- se@assays$RNA$data

# We want to obtain a matrix with 1s and 0s indicating the weight each gene has for each signature
## Initialize mor_mat with all 0s
sources <- unique(net$source)
targets <- rownames(mat)
mor_mat <- matrix(0, ncol = length(sources), nrow = nrow(mat))
colnames(mor_mat) <- sources
rownames(mor_mat) <- targets
weights <- net$mor

# Fill in the matrix with the weights in the right places
for (i in seq_len(nrow(net))) {
    .source <- net$source[[i]]
    .target <- net$target[[i]]
    .weight <- weights[[i]]
    if (.target %in% targets) {
        mor_mat[[.target, .source]] <- .weight
    }
}
# labels for geom_text_repel
repel_text <- rownames(mat)
pax5_targets <- net %>% filter(source == "PAX5") %>% pull(target)
pax5_targets <- pax5_targets[pax5_targets %in% rownames(mat)]
keep <- which(rownames(mat) %in% pax5_targets)
# Set non-selected positions to NA
repel_text[-keep] <- NA

# Visualize the data
# Note that we are using geom_bin2d which bins the data to show how many genes are found at each location (x, y)
ct_b <- "AAACCCACAGCTGAAG-1"
ct_nk <- "ACCAAACTCGGTGTTA-5"
ggplot(mapping = aes(x = mat[, ct_b], y = mor_mat[, "PAX5", drop = FALSE])) +
    # geom_bin2d(bins = 50) +
    geom_point() +
    ggrepel::geom_text_repel(aes(label = repel_text)) +
    geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
    labs(x = "Gene Expression", y = "Gene Weight", title = glue::glue("Cell {ct_b} - B cell")) +
    coord_fixed() +
    scale_fill_gradient(name = "count", trans = "log10") +
    xlim(-0.5, 10) +
    ylim(-2, 2) + # Set the limits from -2 to 2 since the mor values can be negative
    theme_minimal() +
    # now we can add the slope of the line that best fits our data and the T value
    stat_poly_line(formula = x ~ y, se = FALSE) +
    stat_poly_eq(use_label(c("eq"))) +
    stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)

# Visualize the data
ggplot(mapping = aes(x = mat[, ct_nk], y = mor_mat[, "PAX5", drop = FALSE])) +
    geom_point() +
    ggrepel::geom_text_repel(aes(label = repel_text)) +
    geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
    labs(x = "Gene Expression", y = "Gene Weight", title = glue::glue("Cell {ct_nk} - Not a B cell")) +
    coord_fixed() +
    scale_fill_gradient(name = "count", trans = "log10") +
    xlim(-0.5, 10) +
    ylim(-2, 2) +
    theme_minimal() +
    # now we can add the slope of the line that best fits our data and the T value
    stat_poly_line(formula = x ~ y, se = FALSE) +
    stat_poly_eq(use_label(c("eq"))) +
    stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)

Next we are going to manually run the models for these two cells so that we can see that the results obtained from decoupleR make sense!

Lets run the a linear model to score two cell for the B cell signature

mod1 <- lm(as.matrix(mat[, ct_b, drop = FALSE]) ~ mor_mat[, "PAX5", drop = FALSE])
summary(mod1)

Call:
lm(formula = as.matrix(mat[, ct_b, drop = FALSE]) ~ mor_mat[, 
    "PAX5", drop = FALSE])

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3038 -0.0581 -0.0581 -0.0581  6.6019 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     0.058139   0.001866  31.158  < 2e-16 ***
mor_mat[, "PAX5", drop = FALSE] 0.245627   0.041251   5.954 2.64e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.34 on 33232 degrees of freedom
Multiple R-squared:  0.001066,  Adjusted R-squared:  0.001036 
F-statistic: 35.45 on 1 and 33232 DF,  p-value: 2.636e-09
0.245627 / 0.041251 # This equals the T value
[1] 5.95445
mod2 <- lm(as.matrix(mat[, ct_nk, drop = FALSE]) ~ mor_mat[, "PAX5", drop = FALSE])
summary(mod2)

Call:
lm(formula = as.matrix(mat[, ct_nk, drop = FALSE]) ~ mor_mat[, 
    "PAX5", drop = FALSE])

Residuals:
    Min      1Q  Median      3Q     Max 
-0.0890 -0.0821 -0.0821 -0.0821  6.2809 

Coefficients:
                                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      0.082097   0.002016  40.724   <2e-16 ***
mor_mat[, "PAX5", drop = FALSE] -0.006913   0.044568  -0.155    0.877    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3673 on 33232 degrees of freedom
Multiple R-squared:  7.241e-07, Adjusted R-squared:  -2.937e-05 
F-statistic: 0.02406 on 1 and 33232 DF,  p-value: 0.8767
0.013364 / 0.043381 # This equals the T value
[1] 0.3080611

We can see how mod1 has returned a high coefficient for cell AAACCCACAGCTGAAG-1 while mod2 has returned a low coefficient for cell ACCAAACTCGGTGTTA-5. Moreover, when we look at the T value for the PAX5 we see how in mod1 it is 5.95 while for mod2 it is 0.3080611.

Lets check that the T values obtained manually actually match those returned by decoupleR

res %>% filter(source == "PAX5" & condition %in% c(ct_b, ct_nk))
# A tibble: 2 × 5
  statistic source condition           score       p_value
  <chr>     <chr>  <chr>               <dbl>         <dbl>
1 ulm       PAX5   AAACCCACAGCTGAAG-1  5.95  0.00000000264
2 ulm       PAX5   ACCAAACTCGGTGTTA-5 -0.155 0.877        

Effectively, they do!

Session Info

sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggpmisc_0.5.4-1       ggpp_0.5.4            colorBlindness_0.1.9  ComplexHeatmap_2.16.0 decoupleR_2.9.1       lubridate_1.9.2       forcats_1.0.0         stringr_1.5.1         dplyr_1.1.4           purrr_1.0.2           readr_2.1.4           tidyr_1.3.1           tibble_3.2.1          ggplot2_3.5.0         tidyverse_2.0.0       Seurat_5.0.2          SeuratObject_5.0.1    sp_2.1-3             

loaded via a namespace (and not attached):
  [1] RcppAnnoy_0.0.22       splines_4.3.1          later_1.3.2            cellranger_1.1.0       polyclip_1.10-6        confintr_1.0.2         rpart_4.1.19           fastDummies_1.7.3      lifecycle_1.0.4        doParallel_1.0.17      globals_0.16.2         lattice_0.21-8         vroom_1.6.3            MASS_7.3-60            backports_1.4.1        magrittr_2.0.3         Hmisc_5.1-0            plotly_4.10.4          rmarkdown_2.25         yaml_2.3.8             remotes_2.4.2.1        httpuv_1.6.14          sctransform_0.4.1      spam_2.10-0            spatstat.sparse_3.0-3  reticulate_1.35.0.9000 cowplot_1.1.3          pbapply_1.7-2          RColorBrewer_1.1-3     abind_1.4-5            rvest_1.0.3            Rtsne_0.17             BiocGenerics_0.46.0    nnet_7.3-19            rappdirs_0.3.3         circlize_0.4.15        IRanges_2.34.1         S4Vectors_0.38.1       ggrepel_0.9.5          irlba_2.3.5.1          listenv_0.9.1          spatstat.utils_3.0-4   MatrixModels_0.5-2     goftest_1.2-3          RSpectra_0.16-1        spatstat.random_3.2-2  fitdistrplus_1.1-11    parallelly_1.37.0      leiden_0.4.3.1         codetools_0.2-19       xml2_1.3.5            
 [52] tidyselect_1.2.0       shape_1.4.6            farver_2.1.1           base64enc_0.1-3        matrixStats_1.2.0      stats4_4.3.1           spatstat.explore_3.2-6 jsonlite_1.8.8         GetoptLong_1.0.5       Formula_1.2-5          ellipsis_0.3.2         progressr_0.14.0       ggridges_0.5.6         survival_3.5-7         iterators_1.0.14       foreach_1.5.2          progress_1.2.2         tools_4.3.1            ica_1.0-3              Rcpp_1.0.12            glue_1.7.0             gridExtra_2.3          xfun_0.42              withr_3.0.0            BiocManager_1.30.22    fastmap_1.1.1          fansi_1.0.6            SparseM_1.81           digest_0.6.34          timechange_0.2.0       R6_2.5.1               mime_0.12              gridGraphics_0.5-1     colorspace_2.1-0       Cairo_1.6-1            scattermore_1.2        tensor_1.5             spatstat.data_3.0-4    utf8_1.2.4             generics_0.1.3         data.table_1.15.0      prettyunits_1.1.1      httr_1.4.7             htmlwidgets_1.6.4      uwot_0.1.16            pkgconfig_2.0.3        gtable_0.3.4           lmtest_0.9-40          selectr_0.4-2          OmnipathR_3.8.0        htmltools_0.5.7       
[103] dotCall64_1.1-1        clue_0.3-64            scales_1.3.0           png_0.1-8              knitr_1.45             rstudioapi_0.15.0      tzdb_0.4.0             reshape2_1.4.4         rjson_0.2.21           curl_5.2.0             checkmate_2.2.0        nlme_3.1-163           zoo_1.8-12             GlobalOptions_0.1.2    KernSmooth_2.23-22     parallel_4.3.1         miniUI_0.1.1.1         foreign_0.8-84         logger_0.2.2           pillar_1.9.0           vctrs_0.6.5            RANN_2.6.1             promises_1.2.1         xtable_1.8-4           cluster_2.1.4          htmlTable_2.4.1        evaluate_0.23          magick_2.7.5           cli_3.6.2              compiler_4.3.1         rlang_1.1.3            crayon_1.5.2           future.apply_1.11.1    labeling_0.4.3         plyr_1.8.9             stringi_1.8.3          viridisLite_0.4.2      deldir_2.0-2           munsell_0.5.0          lazyeval_0.2.2         spatstat.geom_3.2-8    quantreg_5.97          Matrix_1.6-5           RcppHNSW_0.6.0         hms_1.1.3              patchwork_1.2.0        bit64_4.0.5            future_1.33.1          shiny_1.8.0            ROCR_1.0-11            igraph_2.0.2          
[154] bit_4.0.5              readxl_1.4.3           polynom_1.4-1         

References

Badia-i-Mompel, Pau, Jesús Vélez Santiago, Jana Braunger, Celina Geiss, Daniel Dimitrov, Sophia Müller-Dott, Petr Taus, et al. 2022. “decoupleR: Ensemble of Computational Methods to Infer Biological Activities from Omics Data.” Edited by Marieke Lydia Kuijjer. Bioinformatics Advances 2 (1). https://doi.org/10.1093/bioadv/vbac016.
Baskar, Reema, Amy F. Chen, Patricia Favaro, Warren Reynolds, Fabian Mueller, Luciene Borges, Sizun Jiang, et al. 2022. “Integrating Transcription-Factor Abundance with Chromatin Accessibility in Human Erythroid Lineage Commitment.” Cell Reports Methods 2 (3): 100188. https://doi.org/10.1016/j.crmeth.2022.100188.
Mereu, Elisabetta, Atefeh Lafzi, Catia Moutinho, Christoph Ziegenhain, Davis J. McCarthy, Adrián Álvarez-Varela, Eduard Batlle, et al. 2020. “Benchmarking Single-Cell RNA-Sequencing Protocols for Cell Atlas Projects.” Nature Biotechnology 38 (6): 747–55. https://doi.org/10.1038/s41587-020-0469-4.
Müller-Dott, Sophia, Eirini Tsirvouli, Miguel Vázquez, Ricardo O. Ramirez Flores, Pau Badia-i-Mompel, Robin Fallegger, Astrid Lægreid, and Julio Saez-Rodriguez. 2023. “Expanding the Coverage of Regulons from High-Confidence Prior Knowledge for Accurate Estimation of Transcription Factor Activities.” http://dx.doi.org/10.1101/2023.03.30.534849.
Nadeau, Samantha, and Gislâine A. Martins. 2022. “Conserved and Unique Functions of Blimp1 in Immune Cells.” Frontiers in Immunology 12 (January). https://doi.org/10.3389/fimmu.2021.805260.